Scraping 101

Web scraping is the process of fetching and parsing unstructured or semistructured data from the web, adding structure and (often) saving to a local file for later use.

Scraping is all about finding the patterns in the information displayed on the page.

A few best practices

  • Scraping can be brittle and error-prone -- it should be your last resort
    • Ask for the data first
    • If the page is grabbing and parsing a data file to build the page, intercept it! (ABI: Always Be Inspectin' (the network traffic)) Here's an example.
  • Don't DDOS your target page -- pause between multiple HTTP requests
  • Work on a cached version of the page while you're fiddling with your script

Python modules we'll be using

  • requests for HTTP requests (fetching a page's contents)
  • BeautifulSoup for parsing the page's contents into something that Python can work with
  • csv for writing the data to a file
  • time for pausing between requests

HTML: How web pages are made

To scrape a web page, you need to sort of understand how a web page is built.

Web pages are text files, basically, written in something called HTML (Hypertext Markup Language). HTML elements are represented (usually) by a pair of tags -- an opening tag and a closing tag.

A table, for example, starts with <table> and ends with </table>. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (<tr>) and cells (<td>).

There's a lot more to it, but that's probably good for now.

Inspect the source!

If I'm thinking about scraping a page, the first thing I do is look at the HTML code that makes up the page. You can do this right from your browser -- I like to use Chrome but Firefox has some good developer tools, as well. (Maybe IE does too, who knows lol)

To "view source" in Chrome, you'd hit Ctrl+U on a PC and Cmd+Opt+U on a Mac. It's also in the menu bar: View -> Developer -> View Page Source.

You'll get a page showing you all the HTML code that makes up that page. Locate the element(s) that you want to target and note the structure.

You can also inspect specific elements on the page by right clicking and selecting "Inspect" or "Inspect Element" from the context menu that pops up.

Let's inspect the source on the Texas death row offender page, which we're going to scrape later.

  • View source
  • Ctrl+F to search for the table with the information we want.
  • Is it the only table on the page?
  • Take note of the table's attributes.

Exercise on your own

Inspect the source of this page of certified lead burn instructors in Texas. Find the table of instructors. Is it the only table on the page? Take note of your target table's attributes.